library(tidyverse)
library(visdat)
library(pointblank)
library(skimr)Data Validation in Excel and R
Overview
There are many opportunities for human or instrument error to affect data. Ideally, you want to find those errors and fix them early and often! This workshop introduces some tools in Excel and R to avoid making mistakes in data entry and data collection, and to detect the ones you inadvertently make.
Learning Objectives
Understand best practices for entering data and fixing errors in data
Use Excel data validation tools to prevent data entry errors
Compare data entered by two people in R to check for data entry mistakes
Explore data summaries to check for errors
Get the gist of how you can use the
pointblankpackage to perform data validation checks
Software needed
You’ll need access to Excel, R, RStudio, and the following R packages:
If you can, install the development version of pointblank using this R code:
if(!require("remotes")) {
install.packages("remotes")
}
remotes::install_github("rich-iannone/pointblank")Data Entry in Spreadsheets
A common and useful way to enter data is in a spreadsheet (e.g. in Excel or Google Sheets). Here are some tips for avoiding data entry mistakes and saving yourself time and headaches with data wrangling:
- For categorical variables and notes use consistent values (e.g. don’t sometimes write “dead” and other times write “Dead”)
- Explicitly record missing data instead of leaving cells blank
- Use consistent, concise, descriptive, machine-readable column headers (this is hard!)
- Use ISO (YYYY-MM-DD) format for dates
For more detailed guidelines on using spreadsheets for data entry, I highly recommend Data Organization in Spreadsheets by Broman and Woo (2017)
Avoiding mistakes in data entry
Set up validation tools in your data entry spreadsheet to stop data entry errors in their tracks!
Data Validation Tools in Excel
Select a column (or cells) and choose
Data > Validation …from the menuUse “list” to restrict to specific values for categorical data
Use “whole number” for count data
Use “date” to restrict date ranges
Can also be set up after data entry. Highlight invalid data with “Circle Invalid Data” from toolbar
Watch out for Excel autocorrect!
To stop Excel from converting entries to dates:
Explicitly set all column types to numeric, text, date, etc.
Make sure no columns are set to “general”
Double-entry Method
Two people enter the same data, then compare programatically.
In the
datafolder, there are two versions of a dataset—one entered by Eric and one entered by Jessica.
eric <- read_csv("data/tea_eric.csv")Rows: 60 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): field, counter
dbl (10): plant_id, shoot_1, shoot_2, shoot_3, shoot_4, shoot_5, shoot_6, s...
date (1): date
time (1): time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jessica <- read_csv("data/tea_jessica.csv")Rows: 60 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): field, counter
dbl (10): plant_id, shoot_1, shoot_2, shoot_3, shoot_4, shoot_5, shoot_6, s...
date (1): date
time (1): time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This example dataset is modified from (Scott et al. 2021). Each day, researchers counted the number of leafhoppers (hoppers) on tea plants and measured the length of 7 labeled tea shoots in cm (recorded in shoot_* columns). leaves is how many leaves were searched for leafhoppers on each plant and counter is the initial of the researcher who took measurements.
Compare visually with visdat
We can compare them a couple of ways. First, we can compare them visually using the visdat package. This only works if the two datasets are the same dimensions.
vis_compare(eric, jessica)Compare with dplyr::anti_join()
First add row numbers to make it easier to find mistakes in Excel.
# add rownumbers that match Excel (headers are row 1)
eric <- eric |> mutate(row = 2:(n()+1), .before = date)
jessica <- jessica |> mutate(row = 2:(n()+1), .before = date)anti_join() takes two data frames and returns only rows that differ between them.
#values in `eric` that are different in `jessica`
anti_join(eric, jessica)Joining, by = c("row", "date", "field", "time", "plant_id", "shoot_1",
"shoot_2", "shoot_3", "shoot_4", "shoot_5", "shoot_6", "shoot_7", "hoppers",
"counter", "leaves")
# A tibble: 6 × 15
row date field time plant_id shoot_1 shoot_2 shoot_3 shoot_4 shoot_5
<int> <date> <chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 2016-07-01 A 06:19 10 6.4 2.5 3.8 1 4.6
2 28 2016-07-02 A 06:26 7 3.5 2.7 2.4 2.2 2.4
3 29 2016-07-02 A 06:26 8 8.1 3.8 1.8 6 5.7
4 49 2016-07-03 A 06:11 8 8.5 4.1 2.1 6.4 5.7
5 51 2016-07-03 A 06:11 10 7.9 3.2 5.4 8.6 5.6
6 61 2016-07-03 B 06:31 NA NA NA NA NA NA
# … with 5 more variables: shoot_6 <dbl>, shoot_7 <dbl>, hoppers <dbl>,
# counter <chr>, leaves <dbl>
#values in `jessica` that are different in `eric`
anti_join(jessica, eric)Joining, by = c("row", "date", "field", "time", "plant_id", "shoot_1",
"shoot_2", "shoot_3", "shoot_4", "shoot_5", "shoot_6", "shoot_7", "hoppers",
"counter", "leaves")
# A tibble: 6 × 15
row date field time plant_id shoot_1 shoot_2 shoot_3 shoot_4 shoot_5
<int> <date> <chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 2016-07-01 A 06:19 10 6.4 2.5 3.8 7 4.6
2 28 2016-07-02 A 06:26 7 3.5 2.7 2.4 2.2 2.4
3 29 2016-07-02 A 06:26 8 8.1 3.8 1.8 6 5.7
4 49 2016-07-03 A 06:11 8 8.5 4.1 21 6.4 5.7
5 51 2016-07-03 A 06:11 11 7.9 3.2 5.4 8.6 5.6
6 61 2016-07-03 B 06:31 20 5.8 6.8 16.6 NA 5.8
# … with 5 more variables: shoot_6 <dbl>, shoot_7 <dbl>, hoppers <dbl>,
# counter <chr>, leaves <dbl>
Errors include:
- row 11: messy handwriting? (1 or 7?)
- row 28 & 29: values swapped for
shoot_7
- row 49: missing decimal point
- row 51: discrepancy in plant ID
- row 61: missing data in Eric’s version
#after fixing data-entry errors, we get `tea_resolved.csv`
tea <- read_csv("data/tea_resolved.csv")Rows: 60 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): field, counter
dbl (10): plant_id, shoot_1, shoot_2, shoot_3, shoot_4, shoot_5, shoot_6, s...
date (1): date
time (1): time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Explore data summaries
- You can’t check for errors if you don’t get to know your data!
- Use
skimr::skim()to get a nicely formatted summary - Look for number of unique values for categorical variables
- Look for long tails or strange patterns in mini-histograms for numeric variables
skimr::skim(tea)| Name | tea |
| Number of rows | 60 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| Date | 1 |
| difftime | 1 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| field | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| counter | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2016-07-01 | 2016-07-03 | 2016-07-02 | 3 |
Variable type: difftime
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time | 0 | 1 | 22260 secs | 23460 secs | 22920 secs | 6 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| plant_id | 0 | 1.00 | 10.50 | 5.81 | 1.0 | 5.75 | 10.50 | 15.25 | 20.0 | ▇▇▇▇▇ |
| shoot_1 | 0 | 1.00 | 6.32 | 2.49 | 1.5 | 4.70 | 5.95 | 8.17 | 12.0 | ▃▇▆▇▂ |
| shoot_2 | 0 | 1.00 | 6.72 | 3.75 | 2.2 | 3.95 | 6.35 | 7.58 | 15.2 | ▇▇▁▁▃ |
| shoot_3 | 0 | 1.00 | 6.25 | 3.74 | 1.7 | 3.08 | 5.10 | 8.53 | 16.6 | ▇▃▅▁▁ |
| shoot_4 | 4 | 0.93 | 5.47 | 3.36 | 1.2 | 2.70 | 4.85 | 6.85 | 12.9 | ▇▆▃▁▃ |
| shoot_5 | 0 | 1.00 | 5.44 | 2.44 | 1.8 | 3.70 | 5.00 | 6.55 | 11.5 | ▆▇▆▁▂ |
| shoot_6 | 0 | 1.00 | 6.04 | 3.56 | 0.0 | 3.88 | 4.85 | 7.50 | 14.1 | ▂▇▃▁▂ |
| shoot_7 | 4 | 0.93 | 6.62 | 4.11 | 1.1 | 3.35 | 5.50 | 9.95 | 16.2 | ▇▇▃▅▁ |
| hoppers | 0 | 1.00 | 2.84 | 2.77 | 0.0 | 1.00 | 2.00 | 4.00 | 15.0 | ▇▃▁▁▁ |
| leaves | 0 | 1.00 | 36.67 | 9.51 | 30.0 | 30.00 | 30.00 | 50.00 | 50.0 | ▇▁▁▁▃ |
Or get a more detailed breakdown by running skim() on a grouped data frame:
tea |>
group_by(date) |>
skim()| Name | group_by(tea, date) |
| Number of rows | 60 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| difftime | 1 |
| numeric | 10 |
| ________________________ | |
| Group variables | date |
Variable type: character
| skim_variable | date | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| field | 2016-07-01 | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| field | 2016-07-02 | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| field | 2016-07-03 | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| counter | 2016-07-01 | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
| counter | 2016-07-02 | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| counter | 2016-07-03 | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: difftime
| skim_variable | date | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| time | 2016-07-01 | 0 | 1 | 22740 secs | 23100 secs | 22920 secs | 2 |
| time | 2016-07-02 | 0 | 1 | 22440 secs | 23160 secs | 22800 secs | 2 |
| time | 2016-07-03 | 0 | 1 | 22260 secs | 23460 secs | 22860 secs | 2 |
Variable type: numeric
| skim_variable | date | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| plant_id | 2016-07-01 | 0 | 1.00 | 10.50 | 5.92 | 1.0 | 5.75 | 10.50 | 15.25 | 20.0 | ▇▇▇▇▇ |
| plant_id | 2016-07-02 | 0 | 1.00 | 10.50 | 5.92 | 1.0 | 5.75 | 10.50 | 15.25 | 20.0 | ▇▇▇▇▇ |
| plant_id | 2016-07-03 | 0 | 1.00 | 10.50 | 5.92 | 1.0 | 5.75 | 10.50 | 15.25 | 20.0 | ▇▇▇▇▇ |
| shoot_1 | 2016-07-01 | 0 | 1.00 | 6.06 | 2.50 | 1.5 | 4.32 | 6.00 | 7.78 | 11.0 | ▅▅▇▃▃ |
| shoot_1 | 2016-07-02 | 0 | 1.00 | 6.34 | 2.61 | 1.5 | 4.65 | 6.00 | 8.17 | 11.5 | ▃▇▅▅▃ |
| shoot_1 | 2016-07-03 | 0 | 1.00 | 6.54 | 2.47 | 1.7 | 5.05 | 5.90 | 8.50 | 12.0 | ▁▇▂▅▂ |
| shoot_2 | 2016-07-01 | 0 | 1.00 | 6.43 | 3.73 | 2.2 | 3.32 | 6.10 | 7.58 | 14.0 | ▇▇▂▁▅ |
| shoot_2 | 2016-07-02 | 0 | 1.00 | 6.77 | 3.80 | 2.4 | 3.95 | 6.40 | 7.52 | 15.1 | ▇▇▁▁▃ |
| shoot_2 | 2016-07-03 | 0 | 1.00 | 6.94 | 3.90 | 2.3 | 4.00 | 6.55 | 7.60 | 15.2 | ▇▇▁▁▃ |
| shoot_3 | 2016-07-01 | 0 | 1.00 | 5.97 | 3.74 | 1.7 | 2.90 | 4.70 | 8.45 | 15.3 | ▇▂▃▂▁ |
| shoot_3 | 2016-07-02 | 0 | 1.00 | 6.22 | 3.78 | 1.8 | 3.17 | 4.95 | 8.55 | 15.5 | ▇▃▃▂▁ |
| shoot_3 | 2016-07-03 | 0 | 1.00 | 6.54 | 3.86 | 2.1 | 3.18 | 5.20 | 8.48 | 16.6 | ▇▂▃▁▁ |
| shoot_4 | 2016-07-01 | 1 | 0.95 | 5.07 | 3.35 | 1.2 | 2.50 | 3.90 | 6.35 | 12.0 | ▇▆▅▁▃ |
| shoot_4 | 2016-07-02 | 1 | 0.95 | 5.51 | 3.40 | 1.5 | 2.85 | 4.70 | 6.60 | 12.6 | ▇▇▃▁▃ |
| shoot_4 | 2016-07-03 | 2 | 0.90 | 5.84 | 3.49 | 1.7 | 3.20 | 5.40 | 7.07 | 12.9 | ▇▆▂▁▃ |
| shoot_5 | 2016-07-01 | 0 | 1.00 | 5.10 | 2.42 | 1.8 | 3.35 | 4.55 | 6.43 | 11.0 | ▇▇▇▁▂ |
| shoot_5 | 2016-07-02 | 0 | 1.00 | 5.38 | 2.50 | 2.2 | 3.65 | 5.00 | 6.55 | 11.4 | ▆▇▃▁▂ |
| shoot_5 | 2016-07-03 | 0 | 1.00 | 5.84 | 2.45 | 2.4 | 4.60 | 5.70 | 6.70 | 11.5 | ▅▇▃▁▂ |
| shoot_6 | 2016-07-01 | 0 | 1.00 | 5.76 | 3.36 | 1.9 | 3.82 | 4.60 | 6.60 | 13.5 | ▇▇▁▂▂ |
| shoot_6 | 2016-07-02 | 0 | 1.00 | 6.14 | 3.62 | 1.0 | 4.08 | 4.90 | 7.50 | 13.9 | ▃▇▂▁▂ |
| shoot_6 | 2016-07-03 | 0 | 1.00 | 6.20 | 3.85 | 0.0 | 3.95 | 5.00 | 8.03 | 14.1 | ▂▇▃▁▂ |
| shoot_7 | 2016-07-01 | 1 | 0.95 | 6.22 | 4.15 | 1.1 | 3.15 | 5.00 | 9.65 | 15.4 | ▇▆▂▃▂ |
| shoot_7 | 2016-07-02 | 1 | 0.95 | 6.58 | 4.15 | 1.6 | 3.35 | 5.50 | 9.90 | 15.8 | ▇▆▂▃▂ |
| shoot_7 | 2016-07-03 | 2 | 0.90 | 7.09 | 4.20 | 1.4 | 4.07 | 5.95 | 10.47 | 16.2 | ▇▇▂▅▁ |
| hoppers | 2016-07-01 | 0 | 1.00 | 3.26 | 3.12 | 0.0 | 2.00 | 2.00 | 4.00 | 15.0 | ▇▃▁▁▁ |
| hoppers | 2016-07-02 | 0 | 1.00 | 2.90 | 1.62 | 0.0 | 2.00 | 3.00 | 4.00 | 6.0 | ▅▃▇▅▃ |
| hoppers | 2016-07-03 | 0 | 1.00 | 2.35 | 3.33 | 0.0 | 0.00 | 2.00 | 3.25 | 15.0 | ▇▂▁▁▁ |
| leaves | 2016-07-01 | 0 | 1.00 | 30.00 | 0.00 | 30.0 | 30.00 | 30.00 | 30.00 | 30.0 | ▁▁▇▁▁ |
| leaves | 2016-07-02 | 0 | 1.00 | 30.00 | 0.00 | 30.0 | 30.00 | 30.00 | 30.00 | 30.0 | ▁▁▇▁▁ |
| leaves | 2016-07-03 | 0 | 1.00 | 50.00 | 0.00 | 50.0 | 50.00 | 50.00 | 50.00 | 50.0 | ▁▁▇▁▁ |
Explore data visually
visdat::vis_guess()can help spot inconsistencies- I’ll change one of the plots to a number to demonstrate
#change field in the 10th row to "1"
tea$field[10] <- 1
#doesn't change the type of the column
class(tea$field)[1] "character"
#but vis_guess() spots the mistake!
visdat::vis_guess(tea)Potential mistakes spotted:
- numeric value in
fieldcolumn - a decimal in the
hopperscolumn
Data validation pipelines with pointblank
library(pointblank)pointblankprovides 6 (six!) workflows for validating dataThe Data Quality Reporting Workflow (VALID-1) is probably most useful for this group
Start with a data frame, create an “agent”, tell it what to expect of your data with validation functions, and let it “interrogate” your data
Output is a HTML table with buttons to download CSV files of any data that didn’t pass your validations
pointblank demo
- Decide on “action levels”. Can set a number or fraction of rows as a threshold for a warning or error
al <- action_levels(warn_at = 1, stop_at = .05)
al-- The `action_levels` settings
WARN failure threshold of 1test units.
STOP failure threshold of 0.05 of all test units.
----
- Create agent
agent <-
create_agent(
tbl = tea, #our data example from before
actions = al
)Specify validation conditions
- Basic checks on column types with
col_is_*()functions - Check column values with
col_vals_*()functions - Check rows (e.g. duplicate rows) with
rows_*()functions
- Basic checks on column types with
agent_informed <-
agent |>
col_is_character(c(field, counter)) |> #should be character
col_is_date(date) |> #should be a date
col_vals_in_set(counter, c("W", "G", "E")) |> #The three researchers
col_vals_lt( #expect shoots < 15cm
columns = starts_with("shoot_"),
value = 15,
na_pass = TRUE
) |>
rows_distinct(columns = vars(plant_id, date)) #no duplicate plant IDspointblank historically used the vars() function to specify columns (e.g. col_is_character(vars(field))), but is moving toward a more tidyverse-like approach. However, at the time of writing this, some functions and arguments still require vars() like rows_distinct() above.
- Interrogate!
agent_informed |> interrogate()| Pointblank Validation | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [2022-07-15|14:48:14]
tibble
teaWARN
1
STOP
0.05
NOTIFY
—
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
| 1 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 2 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 3 | col_is_date()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 4 | col_vals_in_set()
|
|
✓ |
60 |
591 |
10 |
● |
○ |
— |
||||
| 5 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 6 | col_vals_lt()
|
|
✓ |
60 |
581 |
20 |
● |
○ |
— |
||||
| 7 | col_vals_lt()
|
|
✓ |
60 |
571 |
30 |
● |
● |
— |
||||
| 8 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 9 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 10 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 11 | col_vals_lt()
|
|
✓ |
60 |
571 |
30 |
● |
● |
— |
||||
| 12 | rows_distinct()
|
— |
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | ||
| 2022-07-15 14:48:14 EDT < 1 s 2022-07-15 14:48:15 EDT | |||||||||||||
Click the blue “CSV” buttons above to download a .csv file of just the rows that failed that particular validation
Flexible validations
If a validation function you need doesn’t exist, you can use col_vals_expr()
E.g. let’s add a validation to check that hoppersis always a whole number.
agent_informed <-
agent_informed |>
col_vals_expr(~ hoppers %% 1 == 0)
agent_informed |> interrogate()| Pointblank Validation | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [2022-07-15|14:48:14]
tibble
teaWARN
1
STOP
0.05
NOTIFY
—
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
| 1 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 2 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 3 | col_is_date()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 4 | col_vals_in_set()
|
|
✓ |
60 |
591 |
10 |
● |
○ |
— |
||||
| 5 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 6 | col_vals_lt()
|
|
✓ |
60 |
581 |
20 |
● |
○ |
— |
||||
| 7 | col_vals_lt()
|
|
✓ |
60 |
571 |
30 |
● |
● |
— |
||||
| 8 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 9 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 10 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 11 | col_vals_lt()
|
|
✓ |
60 |
571 |
30 |
● |
● |
— |
||||
| 12 | rows_distinct()
|
— |
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | ||
| 13 | col_vals_expr()
|
— |
|
✓ |
60 |
591 |
10 |
● |
○ |
— |
|||
| 2022-07-15 14:48:17 EDT < 1 s 2022-07-15 14:48:18 EDT | |||||||||||||
Check that shoots are growing
The tea shoots should be consistently growing (within measurement error), so we can use col_vals_increasing() to check that shoot height is always going up. We use the segment argument to specify that this check happens within each plant_id
agent_informed <-
agent_informed |>
#just with one shoot for demo purposes, but could use starts_with("shoot_")
col_vals_increasing(shoot_1, segments = vars(plant_id))
agent_informed |> interrogate()| Pointblank Validation | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [2022-07-15|14:48:14]
tibble
teaWARN
1
STOP
0.05
NOTIFY
—
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
| 1 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 2 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 3 | col_is_date()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 4 | col_vals_in_set()
|
|
✓ |
60 |
591 |
10 |
● |
○ |
— |
||||
| 5 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 6 | col_vals_lt()
|
|
✓ |
60 |
581 |
20 |
● |
○ |
— |
||||
| 7 | col_vals_lt()
|
|
✓ |
60 |
571 |
30 |
● |
● |
— |
||||
| 8 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 9 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 10 | col_vals_lt()
|
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | |||
| 11 | col_vals_lt()
|
|
✓ |
60 |
571 |
30 |
● |
● |
— |
||||
| 12 | rows_distinct()
|
— |
|
✓ |
60 |
601 |
00 |
○ |
○ |
— |
— | ||
| 13 | col_vals_expr()
|
— |
|
✓ |
60 |
591 |
10 |
● |
○ |
— |
|||
| 14 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 15 | col_vals_increasing()
|
|
✓ |
3 |
21 |
10 |
● |
● |
— |
||||
| 16 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 17 | col_vals_increasing()
|
|
✓ |
3 |
21 |
10 |
● |
● |
— |
||||
| 18 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 19 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 20 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 21 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 22 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 23 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 24 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 25 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 26 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 27 | col_vals_increasing()
|
|
✓ |
3 |
21 |
10 |
● |
● |
— |
||||
| 28 | col_vals_increasing()
|
|
✓ |
3 |
21 |
10 |
● |
● |
— |
||||
| 29 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 30 | col_vals_increasing()
|
|
✓ |
3 |
21 |
10 |
● |
● |
— |
||||
| 31 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 32 | col_vals_increasing()
|
|
✓ |
3 |
31 |
00 |
○ |
○ |
— |
— | |||
| 33 | col_vals_increasing()
|
|
✓ |
3 |
21 |
10 |
● |
● |
— |
||||
| 2022-07-15 14:48:20 EDT 1.5 s 2022-07-15 14:48:21 EDT | |||||||||||||
Publishing validation reports
Students, faculty, and staff at University of Arizona have access to RStudio Connect which allows you to publish an RMarkdown document to the web with a single click. (Learn More)
Data validation can be automated in a variety of ways. If you are interested in more advanced applications of data validation for your lab, contact us!
Fixing mistakes
For true mistakes in data entry (paper -> spreadsheet), probably ok to just edit raw data
For other errors, best practice:
- Don’t edit raw data!
- Record all changes to raw data (e.g. by using an R script to make them)
- Flag observations that have been changed
- Publish raw data, cleaning steps/scripts, and “cleaned” data
Keep an eye out for future workshops on data wrangling/tidying/cleaning
Help
Feel free to drop by the CCT Data Science Team office hours, which happens every Tuesday morning. We would love to help you with your R questions about date/time, and more!
You can also make an appointment with Eric to discuss this content and get troubleshooting help.